The COVID-19 pandemic has reached almost every country during 2020. As an effect of extreme social-distancing measures adopted by governments to fight an increasing sanitary crisis, an unprecedent shutdown of demand and supply chains drove world economies to recession.
In order to quantify impact COVID-19, this study will use IMF estimated figures for GDP real growth for over 180 countries in 2020. It is known that this indicator is normally affected by a multitude of national and international factors, but the magnitude of the recession caused by the pandemic, as well as its global reach, makes it an adequate response variable to represent impact.
The data to be explored includes 20 factors extracted from ourworldindata.com, varying from general demographic figures to more COVID-related statistics. A database with 8 additional features from an IMF research is also used to include information on fiscal and monetary stimulus announced by several countries in response to COVID.
# Check all features to be used as predictors
Features[-c(1,2)]
## [1] "total_cases_per_million"
## [2] "total_deaths_per_million"
## [3] "icu_patients_per_million"
## [4] "hosp_patients_per_million"
## [5] "total_tests_per_thousand"
## [6] "positive_rate"
## [7] "total_vaccinations_per_hundred"
## [8] "people_fully_vaccinated_per_hundred"
## [9] "stringency_index"
## [10] "population_density"
## [11] "median_age"
## [12] "gdp_per_capita"
## [13] "extreme_poverty"
## [14] "cardiovasc_death_rate"
## [15] "diabetes_prevalence"
## [16] "female_smokers"
## [17] "male_smokers"
## [18] "hospital_beds_per_thousand"
## [19] "life_expectancy"
## [20] "human_development_index"
## [21] "Fiscal.additional.subtotal"
## [22] "Fiscal.health"
## [23] "Fiscal.non.health"
## [24] "Fiscal.accelerated.or.deferredrev"
## [25] "Liquidity.subtotal"
## [26] "Liquidity.below.the.line"
## [27] "Liquidity.guarantees"
## [28] "Liquidity.quasi.fiscal"
After proper cleaning and structuring, different statistical methods are applied to data and results are analyzed. For practical purposes, this paper presents results of each step as it describes methodology and justifies the use of a particular approach.
Immunization programs, despite crucial for understanding the most recent scenario, only started in the second half of December and thus its data cannot be used to model GDP real growth in 2020. For this reason, vaccination figures are explored separately and modelled with the use of a different statistical method.
The major challenge of dealing with the available data is the large number of missing data, for both country figures (rows) and features (columns). Although some advanced machine learning algorithms can fit a regression model with missing data, linear correlations cannot be analyzed when that is the case.
For this reason, the database is modified through the following steps:
Notes:
Imputed database will be used for correlation analysis and linear model fitting only. Modelling with other techniques will keep original table with missing data (see next).
Data is partitioned into train and test (validation) datasets.
# Check number of countries left in data
nrow(impTrain)
## [1] 127
# Check number of features left in data
ncol(impTrain)
## [1] 20
# Overview of train data after cleaning and imputation
Data[index,1] %>% cbind(impTrain) %>% head(5)
## . total_cases_per_million total_deaths_per_million
## 159 Zimbabwe 932.993 24.42300
## 14 Belize 27101.184 623.71000
## 50 Eswatini 8066.101 176.69900
## 118 Portugal 40569.764 677.27700
## 43 Dominica 1222.375 76.25451
## positive_rate stringency_index population_density median_age
## 159 0.09500 74.96856 42.729 19.600
## 14 0.23897 72.13777 16.426 25.000
## 50 0.11502 69.60810 79.492 21.500
## 118 0.12600 68.05097 112.371 46.200
## 43 0.08183 43.46094 98.567 30.555
## gdp_per_capita extreme_poverty cardiovasc_death_rate
## 159 1899.775 21.400 307.846
## 14 7824.362 5.636 176.957
## 50 7738.975 16.473 333.436
## 118 27936.896 0.500 127.842
## 43 9673.367 5.541 227.376
## diabetes_prevalence female_smokers male_smokers
## 159 1.82 1.600 30.700
## 14 17.11 7.221 25.251
## 50 3.94 1.700 16.500
## 118 9.85 16.300 30.000
## 43 11.62 2.830 31.114
## hospital_beds_per_thousand life_expectancy human_development_index
## 159 1.70 61.49 0.571
## 14 1.30 74.62 0.716
## 50 2.10 60.19 0.611
## 118 3.39 82.05 0.864
## 43 3.80 75.00 0.742
## Fiscal.additional.subtotal Fiscal.health Fiscal.non.health
## 159 4.8015925 0.1400995 4.661493
## 14 0.7497281 0.5086054 1.329882
## 50 2.8361923 0.3712294 2.464963
## 118 4.7257658 1.0943879 3.631378
## 43 1.9141434 0.0000000 1.914143
## Liquidity.subtotal y
## 159 1.435653 -8.0
## 14 3.298804 -14.1
## 50 1.627220 -3.3
## 118 6.466837 -7.6
## 43 0.708942 -10.4
Feature anaysis and modelling will filter out predictors with weak impact on response variable (GDP real growth).
A table is created with Pearson correlation values between features and response variable, along with p-values of a correlation test. It is important to mention that these results do not imply causation, but simply a relationship or pattern between the values.
Results:
# Check correlation to response variable
Correlation
## Feature Correlation Pvalue
## 2 diabetes_prevalence -0.352922233 0.00004708259
## 3 extreme_poverty 0.346623848 0.00006548532
## 10 human_development_index -0.306317930 0.00046094598
## 19 total_deaths_per_million -0.289934166 0.00094501749
## 11 life_expectancy -0.274039606 0.00182303715
## 14 median_age -0.273703098 0.00184781600
## 12 Liquidity.subtotal -0.259828791 0.00317762548
## 7 Fiscal.non.health -0.214229114 0.01558389940
## 17 stringency_index -0.194875963 0.02812626336
## 5 Fiscal.additional.subtotal -0.194398704 0.02852142883
## 18 total_cases_per_million -0.188435857 0.03387026739
## 9 hospital_beds_per_thousand -0.139501030 0.11776788190
## 4 female_smokers -0.139190644 0.11859527173
## 1 cardiovasc_death_rate 0.122074641 0.17155369905
## 15 population_density -0.079802210 0.37246973129
## 16 positive_rate -0.078655798 0.37939886436
## 8 gdp_per_capita -0.069258868 0.43909618951
## 13 male_smokers -0.057771625 0.51882668429
## 6 Fiscal.health 0.003751348 0.96661214473
## 20 y 1.000000000 NA
Note that 11 out of 19 predictors have a p-value lower than 0.05 in a correlation test, thus presenting sufficient evidence to conclude that there is a significant linear relationship (with alpha = 0.05).
Keep in mind these correlation coefficients are calculated with imputed data, which can affect results. Another limitation of this analysis is that it does not account for non-linear relationships.
The Xgboost is a gradient-boosting algorithm often used in machine learning solutions. The advantages of applying this method in variable selection include the flexibility of modelling with missing data, as well as the fact that it is able to handle co-dependence between predictors - while linear models are not.
A xgboost model is fitted on train data, and parameters are adjusted in order to minimize error observed after prediction in test data. The error metric used in this analysis is the root mean squared error (RMSE).
# Check RMSE of test data
sqrt(mean((Test$y-Pred)^2))
## [1] 5.932597
Correlation results (Correlation and Pvalue) are complemented with xgboost outputs for variable importance in a single table. The column NAs indicates the percentage of missing records in training data, while VarImp and ImpAboveMean are related to the improvement in accuracy brought by each feature in the xboost model.
Results:
# Compare variable importance
FeaturesImp
## Feature Correlation Pvalue NAs VarImp ImpAboveMean
## 2 diabetes_prevalence -0.35 0.00 0.00 0.10 TRUE
## 3 extreme_poverty 0.35 0.00 0.28 0.05 FALSE
## 10 human_development_index -0.31 0.00 0.00 0.24 TRUE
## 19 total_deaths_per_million -0.29 0.00 0.06 0.09 TRUE
## 11 life_expectancy -0.27 0.00 0.00 0.00 FALSE
## 14 median_age -0.27 0.00 0.01 0.02 FALSE
## 12 Liquidity.subtotal -0.26 0.00 0.36 0.00 FALSE
## 7 Fiscal.non.health -0.21 0.02 0.11 0.04 FALSE
## 17 stringency_index -0.19 0.03 0.00 0.01 FALSE
## 5 Fiscal.additional.subtotal -0.19 0.03 0.04 0.01 FALSE
## 18 total_cases_per_million -0.19 0.03 0.01 0.06 TRUE
## 9 hospital_beds_per_thousand -0.14 0.12 0.09 0.02 FALSE
## 4 female_smokers -0.14 0.12 0.16 0.08 TRUE
## 1 cardiovasc_death_rate 0.12 0.17 0.01 0.02 FALSE
## 15 population_density -0.08 0.37 0.01 0.02 FALSE
## 16 positive_rate -0.08 0.38 0.36 0.04 FALSE
## 8 gdp_per_capita -0.07 0.44 0.00 0.09 TRUE
## 13 male_smokers -0.06 0.52 0.17 0.06 TRUE
## 6 Fiscal.health 0.00 0.97 0.11 0.05 FALSE
Among all 11 predicts with a p-value lower than 0.05 in the correlation test, only 4 are above average in variable importance to the xgboost model. Note that stringency index, which is a metric directly related to pandemic response by governments, has no missing values and a p-value of 0.03. Its VarImp, however, is one of the lowest in record.
The next step is to eliminate features with weak results. Only variables with a p-value lower than alpha (0.05) or with above-average VarImp are selected:
# Check features left in data after selection
newTrain %>% names()
## [1] "total_cases_per_million" "total_deaths_per_million"
## [3] "stringency_index" "median_age"
## [5] "gdp_per_capita" "extreme_poverty"
## [7] "diabetes_prevalence" "female_smokers"
## [9] "male_smokers" "life_expectancy"
## [11] "human_development_index" "Fiscal.additional.subtotal"
## [13] "Fiscal.non.health" "Liquidity.subtotal"
## [15] "y"
If a linear model is to be considered, the issue of collinearity among predictors must be addressed. A correlation table summarizes these relationships found in data, indicating correlation coefficient and the p-value status for each case (p < .0001 ‘* * * *’; p < .001 ‘* * *’, p < .01 ‘* *’, p < .05 ‘*’).
Results:
# Show feature correlation table
FeatCorrelation
## total_cases_per_million
## total_cases_per_million
## total_deaths_per_million 0.81****
## stringency_index 0.28**
## median_age 0.64****
## gdp_per_capita 0.61****
## extreme_poverty -0.47****
## diabetes_prevalence 0.02
## female_smokers 0.64****
## male_smokers 0.05
## life_expectancy 0.58****
## human_development_index 0.64****
## Fiscal.additional.subtotal 0.24**
## Fiscal.non.health 0.24**
## Liquidity.subtotal 0.25**
## total_deaths_per_million stringency_index
## total_cases_per_million
## total_deaths_per_million
## stringency_index 0.30***
## median_age 0.59**** 0.15
## gdp_per_capita 0.38**** 0.13
## extreme_poverty -0.43**** -0.29***
## diabetes_prevalence -0.07 0.05
## female_smokers 0.68**** 0.06
## male_smokers 0.02 -0.03
## life_expectancy 0.52**** 0.19*
## human_development_index 0.56**** 0.23**
## Fiscal.additional.subtotal 0.30*** -0.01
## Fiscal.non.health 0.28** 0.01
## Liquidity.subtotal 0.33*** -0.04
## median_age gdp_per_capita extreme_poverty
## total_cases_per_million
## total_deaths_per_million
## stringency_index
## median_age
## gdp_per_capita 0.67****
## extreme_poverty -0.75**** -0.53****
## diabetes_prevalence 0.11 0.20* -0.32***
## female_smokers 0.70**** 0.47**** -0.38****
## male_smokers 0.14 -0.10 -0.18*
## life_expectancy 0.87**** 0.71**** -0.78****
## human_development_index 0.92**** 0.78**** -0.81****
## Fiscal.additional.subtotal 0.49**** 0.35**** -0.30***
## Fiscal.non.health 0.53**** 0.40**** -0.35****
## Liquidity.subtotal 0.49**** 0.31*** -0.28**
## diabetes_prevalence female_smokers male_smokers
## total_cases_per_million
## total_deaths_per_million
## stringency_index
## median_age
## gdp_per_capita
## extreme_poverty
## diabetes_prevalence
## female_smokers -0.14
## male_smokers 0.14 0.14
## life_expectancy 0.25** 0.55**** 0.07
## human_development_index 0.20* 0.62**** 0.08
## Fiscal.additional.subtotal -0.06 0.43**** 0.09
## Fiscal.non.health -0.02 0.45**** 0.04
## Liquidity.subtotal 0.07 0.33*** 0.01
## life_expectancy human_development_index
## total_cases_per_million
## total_deaths_per_million
## stringency_index
## median_age
## gdp_per_capita
## extreme_poverty
## diabetes_prevalence
## female_smokers
## male_smokers
## life_expectancy
## human_development_index 0.93****
## Fiscal.additional.subtotal 0.45**** 0.49****
## Fiscal.non.health 0.49**** 0.54****
## Liquidity.subtotal 0.42**** 0.41****
## Fiscal.additional.subtotal Fiscal.non.health
## total_cases_per_million
## total_deaths_per_million
## stringency_index
## median_age
## gdp_per_capita
## extreme_poverty
## diabetes_prevalence
## female_smokers
## male_smokers
## life_expectancy
## human_development_index
## Fiscal.additional.subtotal
## Fiscal.non.health 0.97****
## Liquidity.subtotal 0.41**** 0.42****
Principal component analysis is one of the appropriate methods of solving the collinearity among variables. It is applied to 4 groups of predictors and only the first component (PC1) of each group will be used in final database:
At this point data had a decrease in features from 28 to 6:
# Check features in final data
newTrain %>% names()
## [1] "stringency_index" "diabetes_prevalence" "y"
## [4] "Demographics" "Cases_and_deaths" "Stimulus"
## [7] "Smokers"
The main goal of fitting a linear model is not to generate predictions, but to analyze its outputs:
Results:
# Show linear model output
summary(LinearModel)
##
## Call:
## lm(formula = y ~ ., data = newTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0074 -2.0151 0.0972 2.2746 10.6906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.85097841 1.66502280 1.112 0.268497
## stringency_index -0.05282832 0.02733710 -1.932 0.055657 .
## diabetes_prevalence -0.41628165 0.08563977 -4.861 0.00000358 ***
## Demographics 0.00006253 0.00002251 2.778 0.006353 **
## Cases_and_deaths -0.00005661 0.00002588 -2.187 0.030673 *
## Stimulus 0.21890758 0.05775312 3.790 0.000237 ***
## Smokers 0.01389030 0.02813462 0.494 0.622415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.823 on 120 degrees of freedom
## Multiple R-squared: 0.2803, Adjusted R-squared: 0.2443
## F-statistic: 7.79 on 6 and 120 DF, p-value: 0.0000004338
As indicated by the (adjusted) R-squared metric, this model is only able to explain 24.6% of the variance of GDP real growth of 2020. Despite the low value, fitting a linear model had the main goal of understanding the relationships between factors and response variable.
Smokers is the only variable for which its beta is not significant at an alpha of 0.1. It is eliminated from the database and another model is fitted:
# Show second linear model output
summary(LinearModel)
##
## Call:
## lm(formula = y ~ ., data = newTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0047 -2.0270 -0.0081 2.2734 10.8744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.85026227 1.65981081 1.115 0.267170
## stringency_index -0.05374555 0.02718853 -1.977 0.050342 .
## diabetes_prevalence -0.40941825 0.08423943 -4.860 0.00000356 ***
## Demographics 0.00006028 0.00002198 2.743 0.007012 **
## Cases_and_deaths -0.00005367 0.00002511 -2.137 0.034580 *
## Stimulus 0.21564104 0.05719331 3.770 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.811 on 121 degrees of freedom
## Multiple R-squared: 0.2788, Adjusted R-squared: 0.249
## F-statistic: 9.357 on 5 and 121 DF, p-value: 0.0000001473
Pred <- predict(LinearModel, newTest)
sqrt(mean((newTest$y-Pred)^2))
## [1] 5.403634
The second model has all coefficients with p-value lower than 0.1 and also shows minor improvement on adjusted R-squared.
As noticed in the coefficient table, all but 2 variables (Demographics and Stimulus) have opposite impact on GDP real growth, that is, response variable tend to increase if they decrease.
Note: RMSE for test data is slightly better than that of the Xgboost model (5.932597).
Vaccination could not be used in modelling since it has only started in the second half of December 2020. Despite of that, data of countries already advanced in the process of immunization is available and can be used to further analyze how it can impact other factors.
Stringency index, a statistically significant coefficient in our linear model (alpha = 0.1), is expected to drop as more people become immune to the virus. It is possible to validate this hypothesis by using VAR/VECM time series modelling, as a cointegration between stringency and vaccination figures could provide favorable evidence.
Israel, a country where oved 60% of its population is already vaccinated, will be our case for this analysis. From its data, 2 time-series are selected: total vaccination and stringency index, starting from December 19th 2020.
# Overview of time series
Data[,c(1,9)] %>% head(5)
## total_vaccinations stringency_index
## 2020-12-19 07:00:00 61 69.71143
## 2020-12-20 07:00:00 7433 70.50571
## 2020-12-21 07:00:00 32304 71.30000
## 2020-12-22 07:00:00 76914 71.69571
## 2020-12-23 07:00:00 139728 72.09143
Stringency index is modified to better show an eventual relationship to total vaccinations: data is smoothed with a one week simple moving average and inverted.
# Compare time series
plot.ts(CompareTimeSeries)
Inverted stringency index apparently follows total vaccination as it increases.
This relationship between vaccination and stringency index is time-lagged, as one time series must be shifted in time to present correlation to another.
The selection of an optimal lag length to be used in a VAR model estimation can be done in different ways. Plotting a cross correlation function (CCF) chart of the log-transformed time series is useful for a visual approach in determining best lag order:
# Select best lag
ccf(VaccinationTransformed,StringencyTransformed,lag.max = 10)
The blue dotted lines represent limits for a statistically significant correlation. In this case, lags from -7 to -10 periods in total vaccination show relationship to inverted stringency index.
Another method of selecting optimal lag is to calculate different metrics used as criteria. Final Prediction Error (FPE) Akaike (AIC), Hannah-Quinn (HQ) and Schwarz information criteria (SC) are commonly used methodologies for this procedure.
# Check best lag order
vModel
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 8 8 8 8
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) 3.414806 3.454960 3.375105 3.417890 3.444158 3.254624
## HQ(n) 3.479591 3.562935 3.526269 3.612243 3.681701 3.535357
## SC(n) 3.575078 3.722080 3.749072 3.898705 4.031821 3.949135
## FPE(n) 30.412296 31.663009 29.242512 30.538683 31.380281 25.996764
## 7 8 9 10
## AIC(n) 3.312524 2.587262 2.630491 2.670644
## HQ(n) 3.636446 2.954374 3.040792 3.124136
## SC(n) 4.113883 3.495469 3.645545 3.792547
## FPE(n) 27.596418 13.394025 14.028312 14.658435
The lag order that minimises each of the criteria indicates the optimal value for estimating a VAR model. Note that all of the 4 methods indicates 8 as optimal, which is used for fitting.
A cointegration test is used to establish if there is a correlation between vaccination and stringency index in the long term. The Johansen test (or procedure) is used to analyze cointegration relationship among these variables:
# Test for cointegration
summary(JTest)
##
## ######################
## # Johansen-Procedure #
## ######################
##
## Test type: trace statistic , with linear trend
##
## Eigenvalues (lambda):
## [1] 0.14314925 0.04153513
##
## Values of teststatistic and critical values of test:
##
## test 10pct 5pct 1pct
## r <= 1 | 4.20 6.50 8.18 11.65
## r = 0 | 19.49 15.66 17.95 23.52
##
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
##
## TotalVaccination.l8 InverseStringencyIndex.l8
## TotalVaccination.l8 1.0000000 1.00000
## InverseStringencyIndex.l8 0.1477296 -2.62168
##
## Weights W:
## (This is the loading matrix)
##
## TotalVaccination.l8 InverseStringencyIndex.l8
## TotalVaccination.d -0.011258817 0.002142991
## InverseStringencyIndex.d 0.001806694 0.005931219
For r = 0, test statistic (19.49) is higher than the 5% significance level, presenting evidence favorable to cointegration at a lag order of 8.
This paper aimed at understanding which factors have heavier influence in 2020 GDP real growth, considering this variable as a representation of COVID-19 impact. Quantitative analysis of features was performed along the study and provided valuable insights on how they relate to response variable and among them.
The use of principal component analysis helped to group correlated predictor into groups of similar contexts. Variables connected to fiscal and monetary stimulus were put together in a single predictor, which came to be significant in our final model and showed direct linear relationship to GPD real growth.
Stringency index, as well as cases and deaths, are other variables in our model directly related to COVID-19 - as opposed to demographics and diabetes prevalence, for example. For countries where the pandemic is still peaking, such as Brazil, stimulus is still an option to respond to COVID-19 impact on the economy. National or regional restrictions may also impact stringency index and number of cases and deaths.
In this scenario, a practical application of the fitted model is to estimate the effect of these factors on the next yearly growth output (2021).
As immunization programs advance, it is expected that governments will start easing social-distancing restrictions. As shown in the Israeli case, there is evidence of cointegration between total vaccinations and stringency index at a lag order of 8 - implying that a change in vaccinated population may have an impact on stringency shortly after one week.
This timeframe cannot be generalized to other countries, especially due to the diverse effectiveness of vaccine products and different approaches to social-distancing restrictions. However, it offers a hint on how vaccination drives to a better context for economic reopening.